Category-based Statistical Language Models Synopsis

نویسنده

  • Thomas Niesler
چکیده

Language models are computational techniques and structures that describe word sequences produced by human subjects, and the work presented here considers primarily their application to automatic speech-recognition systems. Due to the very complex nature of natural languages as well as the need for robust recognition, statistically-based language models, which assign probabilities to word sequences, have proved most successful. This thesis focuses on the use of linguistically defined word categories as a means of improving the performance of statistical language models. In particular, an approach that aims to capture both general grammatical patterns, as well as particular word dependencies, using different model components is proposed, developed and evaluated. To account for grammatical patterns, a model employing variable-length n-grams of part-of-speech word categories is developed. The often local syntactic patterns in English text are captured conveniently by the n-gram structure, and reduced sparseness of the data allows larger n to be employed. A technique that optimises the length of individual n-grams is proposed, and experimental tests show it to lead to improved results. The model allows words to belong to multiple categories in order to cater for different grammatical functions, and may be employed as a tagger to assign category classifications to new text. While the category-based model has the important advantage of generalisation to unseen word sequences , it is by nature not able to capture relationships between particular words. An experimental comparison with word-based n-gram approaches reveals this ability to be important to language model quality, and consequently two methods allowing the inclusion of word relations are developed. The first method allows the incorporation of selected word n-grams within a backoff framework. The number of word n-grams added may be controlled, and the resulting tradeoff between size and accuracy is shown to surpass that of standard techniques based on n-gram cutoffs. The second technique addresses longer-range word-pair relationships that arise due to factors such as the topic or the style of the text. Empirical evidence is presented demonstrating an approximately exponentially decaying behaviour when considering the probabilities of related words as a function of an appropriately defined separating distance. This definition, which is fundamental to the approach, is made in terms of the category assignments of the words. It minimises the effect syntax has on word co-occurrences while taking particular advantage of the grammatical word classifications implicit in the operation of the category model. Since only related words are treated, …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

THE RELATIVISTIC ENERGY-MOMENTUM TENSOR IN POLARIZED MEDIA VII. DISCUSSION OF THE RESULTS IN CONNEXION WITH PREVIOUS WORK *) by S. R. de GROOT and L. G. SUTTORP

synopsis The literature on the relativistic energy-momentum tensor in polarized media falls apart in treatments based on microscopic first principles and considerations starting from macroscopic postulates. Only papers of the first category, such as Lorentz’s and Einstein-Laub’s, can be considered as derivations, whereas treatments of the second category, such as Minkowski’s and Abraham’s, base...

متن کامل

Statistical and Linguistic Clustering for Language Modeling in ASR

In this work several sets of categories obtained by a statistical clustering algorithm, as well as a linguistic set, were used to design category-based language models. The language models proposed were evaluated, as usual, in terms of perplexity of the text corpus. Then they were integrated into an ASR system and also evaluated in terms of system performance. It can be seen that category-based...

متن کامل

Feature-based Sentiment Analysis on Android App Reviews Using SAS® Text Miner and SAS® Sentiment Analysis Studio

Sentiment analysis is a popular technique for summarizing and analyzing consumers’ textual reviews about products and services. There are two major approaches for performing sentiment analysis; statistical model based approaches and Natural Language Processing (NLP) based approaches to create rules. In this study, we first apply text mining to summarize users’ reviews of Android Apps and extrac...

متن کامل

An Axiomatization of Computationally Adequate Domain Theoretic Models of FPC

Synopsis Categorical models of the metalanguage FPC (a type theory with sums, products, exponentials and recursive types) are deened. Then, domain-theoretic models of FPC are axiomatised and a wide subclass of them |the non-trivial and absolute ones| are proved to be both computationally sound and adequate. Examples include: the category of cpos and partial continuous functions and functor cate...

متن کامل

Multi-label Text Categorization with Model Combination based on F1-score Maximization

Text categorization is a fundamental task in natural language processing, and is generally defined as a multi-label categorization problem, where each text document is assigned to one or more categories. We focus on providing good statistical classifiers with a generalization ability for multi-label categorization and present a classifier design method based on model combination and F1-score ma...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997